Goto

Collaborating Authors

 Human Computer Interaction


REASONER: An Explainable Recommendation Dataset with Comprehensive Labeling Ground Truths, Lei Wang

Neural Information Processing Systems

Explainable recommendation has attracted much attention from the industry and academic communities. It has shown great potential to improve the recommendation persuasiveness, informativeness and user satisfaction. In the past few years, while a lot of promising explainable recommender models have been proposed, the datasets used to evaluate them still suffer from several limitations, for example, the explanation ground truths are not labeled by the real users, the explanations are mostly single-modal and around only one aspect. To bridge these gaps, in this paper, we build a new explainable recommendation dataset, which, to our knowledge, is the first contribution that provides a large amount of real user labeled multi-modal and multi-aspect explanation ground truths. In specific, we firstly develop a video recommendation platform, where a series of questions around the recommendation explainability are carefully designed.


MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

Neural Information Processing Systems

Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model to improve gesture synthesis. However, the high computational complexity of these techniques limits the application in reality. In this study, we explore the potential of state space models (SSMs). Direct application of SSMs in gesture synthesis encounters difficulties, which stem primarily from the diverse movement dynamics of various body parts. The generated gestures may also exhibit unnatural jittering issues. To address these, we implement a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Built upon the selective scan mechanism, we introduce MambaTalk, which integrates hybrid fusion modules, local and global scans to refine latent space representations. Subjective and objective experiments demonstrate that our method surpasses the performance of state-of-the-art models.


EEVR: A Dataset of Paired Physiological Signals and Textual Descriptions for Joint Emotion Representation Learning

Neural Information Processing Systems

Figure 2: The figure presents still images extracted from 360 videos used in the experiment to display various environments to the participants. The videos were selected from the publically available 360 VR video dataset (Li et al. (2017).) The EEVR dataset comprises synchronized pairs of physiological signals and textual data. It includes responses to four self-assessment questions regarding perceived arousal, valence, dominance, and discrete emotions ratings collected using PANAS questionnaires (which were further utilized to calculate Positive and Negative Affect Score). The EEVR dataset was collected using Virtual Reality (VR) 360 videos as the elicitation medium. The videos utilized in the dataset were selected based on their arousal and valence ratings to cover all four quadrants of the Russell circumplex emotion model (Russell et al. (1989)), as shown in Figure 2. The remainder of the supplementary materials provide detailed information about the EEVR dataset. Figure 3 provides a datasheet for the EEVR dataset based on Gebru et al. (2018).


EEVR: A Dataset of Paired Physiological Signals and Textual Descriptions for Joint Emotion Representation Learning

Neural Information Processing Systems

EEVR (Emotion Elicitation in Virtual Reality) is a novel dataset specifically designed for language supervision-based pre-training of emotion recognition tasks, such as valence and arousal classification. It features high-quality physiological signals, including electrodermal activity (EDA) and photoplethysmography (PPG), acquired through emotion elicitation via 360-degree virtual reality (VR) videos. Additionally, it includes subject-wise textual descriptions of emotions experienced during each stimulus gathered from qualitative interviews. The dataset consists of recordings from 37 participants and is the first dataset to pair raw text with physiological signals, providing additional contextual information that objective labels cannot offer. To leverage this dataset, we introduced the Contrastive Language Signal Pre-training (CLSP) method, which jointly learns representations using pairs of physiological signals and textual descriptions. Our results show that integrating self-reported textual descriptions with physiological signals significantly improves performance on emotion recognition tasks, such as arousal and valence classification. Moreover, our pre-trained CLSP model demonstrates strong zero-shot transferability to existing datasets, outperforming supervised baseline models, suggesting that the representations learned by our method are more contextualized and generalized. The dataset also includes baseline models for arousal, valence, and emotion classification, as well as code for data cleaning and feature extraction.


LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

Neural Information Processing Systems

Referring 3D Segmentation is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. Previous works perform a two-stage paradigm, first conducting language-agnostic instance segmentation then matching with given text query. However, the semantic concepts from text query and visual cues are separately interacted during the training, and both instance and semantic labels for each object are required, which is time consuming and human-labor intensive. To mitigate these issues, we propose a novel Referring 3D Segmentation pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.


ProEdit: Simple Progression is All You Need for High-Quality 3D Scene Editing

Neural Information Processing Systems

This paper proposes ProEdit - a simple yet effective framework for high-quality 3D scene editing guided by diffusion distillation in a novel progressive manner. Inspired by the crucial observation that multi-view inconsistency in scene editing is rooted in the diffusion model's large feasible output space (FOS), our framework controls the size of FOS and reduces inconsistency by decomposing the overall editing task into several subtasks, which are then executed progressively on the scene.




I wore Google's XR glasses, and they already beat my Ray-Ban Meta in 3 ways

ZDNet

Google unveiled a slew of new AI tools and features at I/O, dropping the term Gemini 95 times and AI 92 times. However, the best announcement of the entire show wasn't an AI feature; rather, the title went to one of the two hardware products announced -- the Android XR glasses. CNET: I hated smart glasses until I tried Google's Android XR. For the first time, Google gave the public a look at its long-awaited smart glasses, which pack Gemini's assistance, in-lens displays, speakers, cameras, and mics into the form factor of traditional eyeglasses. I had the opportunity to wear them for five minutes, during which I ran through a demo of using them to get visual Gemini assistance, take photos, and get navigation directions.


EyeGraph: Modularity-aware Spatio Temporal Graph Clustering for Continuous Event-based Eye Tracking

Neural Information Processing Systems

Continuous tracking of eye movement dynamics plays a significant role in developing a broad spectrum of human-centered applications, such as cognitive skills modeling, biometric user authentication, and foveated rendering. Recently neuromorphic cameras have garnered significant interest in the eye-tracking research community, owing to their sub-microsecond latency in capturing intensity changes resulting from eye movements. Nevertheless, the existing approaches for eventbased eye tracking suffer from several limitations: dependence on RGB frames, label sparsity, and training on datasets collected in controlled lab environments that do not adequately reflect real-world scenarios. To address these limitations, in this paper, we propose a dynamic graph-based approach that uses the event stream for high-fidelity tracking of pupillary movement. We first present EyeGraph, a large-scale, multi-modal near-eye tracking dataset collected using a wearable event camera attached to a head-mounted device from 40 participants - the dataset was curated while mimicking in-the-wild settings, with variations in user movement and ambient lighting conditions. Subsequently, to address the issue of label sparsity, we propose an unsupervised topology-aware spatio-temporal graph clustering approach as a benchmark. We show that our unsupervised approach achieves performance comparable to more onerous supervised approaches while consistently outperforming the conventional clustering-based unsupervised approaches.